快速生产具有纳米分辨率的大面积模式对于已建立的半导体行业和实现下一代量子设备的工业规模生产至关重要。具有二进制全息掩模的亚稳定原子光刻被认为是当前最新水平的较高分辨率/低成本替代方法:极端紫外线(EUV)光刻。然而,最近表明,亚稳定原子与掩模材料(SIN)的相互作用导致波前的强烈扰动,而不是基于经典标量波。这意味着即使在1D中也无法在分析上解决逆问题(基于所需模式创建掩码)。在这里,我们提出了一种机器学习方法,以掩盖产生的目标是亚稳定性原子。我们的算法结合了遗传优化和深度学习来获得面具。一种新型的深神经结构经过训练,可以产生面膜的初始近似。然后,该近似值用于生成可以收敛到任意精度的遗传优化算法的初始种群。我们证明了Fraunhofer近似极限内系统维度的任意1D模式的产生。
translated by 谷歌翻译
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.
translated by 谷歌翻译
Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, navigation in the space, or human-human interactions -- that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.
translated by 谷歌翻译
While object reconstruction has made great strides in recent years, current methods typically require densely captured images and/or known camera poses, and generalize poorly to novel object categories. To step toward object reconstruction in the wild, this work explores reconstructing general real-world objects from a few images without known camera poses or object categories. The crux of our work is solving two fundamental 3D vision problems -- shape reconstruction and pose estimation -- in a unified approach. Our approach captures the synergies of these two problems: reliable camera pose estimation gives rise to accurate shape reconstruction, and the accurate reconstruction, in turn, induces robust correspondence between different views and facilitates pose estimation. Our method FORGE predicts 3D features from each view and leverages them in conjunction with the input images to establish cross-view correspondence for estimating relative camera poses. The 3D features are then transformed by the estimated poses into a shared space and are fused into a neural radiance field. The reconstruction results are rendered by volume rendering techniques, enabling us to train the model without 3D shape ground-truth. Our experiments show that FORGE reliably reconstructs objects from five views. Our pose estimation method outperforms existing ones by a large margin. The reconstruction results under predicted poses are comparable to the ones using ground-truth poses. The performance on novel testing categories matches the results on categories seen during training. Project page: https://ut-austin-rpl.github.io/FORGE/
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
作为网络防御的重要工具,欺骗正在迅速发展,并补充了现有的周边安全措施,以迅速检测出漏洞和数据盗窃。限制欺骗使用的因素之一是手工生成逼真的人工制品的成本。但是,机器学习的最新进展为可扩展的,自动化的现实欺骗创造了机会。本愿景论文描述了开发模型所涉及的机会和挑战,以模仿IT堆栈的许多共同元素以造成欺骗效应。
translated by 谷歌翻译
第一人称视频在其持续环境的背景下突出了摄影师的活动。但是,当前的视频理解方法是从短视频剪辑中的视觉特征的原因,这些视频片段与基础物理空间分离,只捕获直接看到的东西。我们提出了一种方法,该方法通过学习摄影师(潜在看不见的)本地环境来促进以人为中心的环境的了解来链接以自我为中心的视频和摄像机随着时间的推移而张开。我们使用来自模拟的3D环境中的代理商的视频进行训练,在该环境中,环境完全可以观察到,并在看不见的环境的房屋旅行的真实视频中对其进行测试。我们表明,通过将视频接地在其物理环境中,我们的模型超过了传统的场景分类模型,可以预测摄影师所处的哪个房间(其中帧级信息不足),并且可以利用这种基础来定位与环境相对应的视频瞬间 - 中心查询,优于先验方法。项目页面:http://vision.cs.utexas.edu/projects/ego-scene-context/
translated by 谷歌翻译
我们介绍了Soundspaces 2.0,这是一个用于3D环境的基于几何的音频渲染的平台。考虑到现实世界环境的3D网格,Soundspaces可以为从任意麦克风位置捕获的任意声音生成高度逼真的声音。它与现有的3D视觉资产一起支持一系列视听研究任务,例如视听导航,映射,源定位和分离以及声学匹配。与现有资源相比,Soundspaces 2.0具有允许连续的空间采样,对新型环境的概括以及可配置的麦克风和材料属性的优点。据我们所知,这是第一个基于几何的声学模拟,它提供了高忠诚和现实主义,同时也足够快地用于体现学习。我们展示了模拟器的属性,并根据现实世界的音频测量进行了基准性能。此外,通过涵盖具体导航和远场自动语音识别的两个下游任务,突出了后者的SIM2REAL性能。 Soundspaces 2.0可公开使用,以促进对感知系统的更广泛研究,这些系统既可以看到和听到。
translated by 谷歌翻译
以前的工作在很大程度上通过“偏见”的透镜指定的镜头考虑了图像字幕系统的公平性。相比之下,我们提供了一组技术,用于测量五种类型的代表性危害以及使用最流行的图像标题数据集获得的最终测量结果。我们的目标不是审核此图像字幕系统,而是要开发规范性的测量技术,进而提供了一个机会来反思所涉及的许多挑战。我们提出了每种危害类型的多种测量技术。我们认为,这样做可以更好地捕获每种危害的多方面性质,从而改善了所得测量值的(集体)有效性。在整个过程中,我们讨论了我们的测量方法的基础假设,并指出了它们不进行的假设。
translated by 谷歌翻译
从历史上看,患者数据集已用于开发和验证PET/MRI和PET/CT的各种重建算法。为了使这种算法开发,无需获得数百个患者检查,在本文中,我们展示了一种深度学习技术,可以从丰富的全身MRI中产生合成但逼真的全身宠物纹状体。具体来说,我们使用56 $^{18} $ F-FDG-PET/MRI考试的数据集训练3D残差UNET来预测全身T1加权MRI的生理PET摄取。在训练中,我们实施了平衡的损失函数,以在较大的动态范围内产生逼真的吸收,并沿着层析成像线的响应线对模仿宠物的获取产生计算的损失。预测的PET图像预计会产生合成宠物飞行时间(TOF)正式图,可与供应商提供的PET重建算法一起使用,包括使用基于CT的衰减校正(CTAC)和基于MR的衰减校正(MRAC(MRAC) )。由此产生的合成数据概括了生理学$^{18} $ f-fdg摄取,例如高摄取量位于大脑和膀胱,以及肝脏,肾脏,心脏和肌肉的吸收。为了模拟高摄取的异常,我们还插入合成病变。我们证明,该合成PET数据可以与实际PET数据互换使用,用于比较CT和基于MR的衰减校正方法的PET量化任务,与使用真实数据相比,在平均值中实现了$ \ leq 7.6 \%$误差。这些结果共同表明,所提出的合成PET数据管道可以合理地用于开发,评估和验证PET/MRI重建方法。
translated by 谷歌翻译